Your First Machine Learning Model

Selecting Data for Modeling

你的数据集包含的变量太多，以至于难以理解（就像上一章中的房屋列名），甚至无法很好地打印出来。你如何将这些压倒性的数据量缩减为你可以理解的内容呢？

我们将从直觉上挑选一些变量开始。后续的课程将向你展示统计技术，以自动优先考虑变量。

为了选择变量/列，我们需要看到数据集中所有列的列表。这可以通过DataFrame的columns属性来完成（下面的代码的最后一行）。

python

import pandas as pd

# 读取数据集
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# DataFrame
melbourne_data = pd.read_csv(melbourne_file_path) 
# 查看列名
melbourne_data.columns

python

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

python

# 墨尔本的数据有一些缺失值（有些房屋没有记录某些变量）。
# 我们将在稍后的教程中学习如何处理缺失值。
# 你的爱荷华州数据在你使用的列中没有缺失值。
# 所以现在我们将采取最简单的选项，从我们的数据中删除房屋。
# 现在不要太担心这个问题，尽管代码是这样的：
# dropna 用于删除缺失值（将 "na" 理解为 "not available"）,axis 参数决定了你想要删除缺失值的维度：
# axis = 0 沿着行的方向，删除任何包含至少一个缺失值的行。这是默认的设置。
# axis=1：沿着列的方向，删除任何全部为缺失值的列
melbourne_data = melbourne_data.dropna(axis=0)

在Pandas中，有多种方法可以选择数据的子集。Pandas课程会更深入地介绍这些内容，但目前我们将专注于两种方法：

点表示法（Dot Notation）：我们用它来选择“预测目标”。使用点表示法，你可以通过列名直接访问DataFrame中的列。这种方法通常用于选择你想要预测的目标变量。
使用列列表选择（column list）：我们用它来选择“特征”。当你需要选择多个列作为特征时，可以使用列名的列表来选择这些列。这在机器学习中特别常见，其中特征用于构建模型。

Selecting The Prediction Target

你可以使用**点表示法（dot-notation）**来提取一个变量。这个单独的列存储在一个Series对象中，这可以被广泛理解为只有单一列数据的DataFrame。

在Pandas中，“Series” 通常翻译为“序列”。在中文语境中，它指的是一个一维数组，可以包含任何数据类型，并且每个元素都有一个索引标签。这使得它非常适合表示数据集中的单个变量或列。

我们将使用点表示法来选择我们想要预测的列，这被称为预测目标。按照惯例，预测目标被称为 y。所以，我们需要保存墨尔本数据中房价的代码是：

python

y = melbourne_data.Price

Choosing “Features”

输入到我们模型中的列（以及稍后用于进行预测的列）被称为**“Features”(特征)**。在我们的例子中，这些将是用于确定房屋价格的列。有时，你会使用除目标列以外的所有列作为特征。其他时候，使用较少的特征可能会更好。

现在，我们将只使用一些特征来构建模型。稍后，你将看到如何迭代并比较使用不同特征构建的模型。

我们通过在括号内提供列名列表来选择多个特征。该列表中的每个项目应该是一个字符串（带引号）。

python

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

按照惯例，用于输入模型的数据被称为 X。在机器学习和统计建模中，X 通常表示特征矩阵或特征数据集，它包含了模型训练和预测时使用的所有输入变量。这些特征用于帮助模型学习如何预测目标变量 y。

python

X = melbourne_data[melbourne_features]

好的，我们将通过使用 describe 方法和 head 方法来快速回顾我们将用于预测房价的数据。
使用 describe 方法： describe 方法提供了数据的描述性统计信息，包括计数、平均值、标准差、最小值、第25百分位数、中值（第50百分位数）、第75百分位数和最大值。这有助于了解数据的分布和特征。
使用 head 方法： head 方法返回DataFrame的前几行，默认情况下是前5行。这有助于查看数据的前几行记录，以了解数据的结构和内容。
(小声bb)实际上还有tail方法，用于查看数据集的后几行（默认情况下是后5行）

python

X.describe()

Rooms	Bathroom	Landsize	Lattitude	Longtitude
count	6196.000000	6196.000000	6196.000000	6196.000000	6196.000000
mean	2.931407	1.576340	471.006940	-37.807904	144.990201
std	0.971079	0.711362	897.449881	0.075850	0.099165
min	1.000000	1.000000	0.000000	-38.164920	144.542370
25%	2.000000	1.000000	152.000000	-37.855438	144.926198
50%	3.000000	1.000000	373.000000	-37.802250	144.995800
75%	4.000000	2.000000	628.000000	-37.758200	145.052700
max	8.000000	8.000000	37000.000000	-37.457090	145.526350

python

X.head()

Rooms	Bathroom	Landsize	Lattitude	Longtitude
1	2	1.0	156.0	-37.8079	144.9934
2	3	2.0	134.0	-37.8093	144.9944
4	4	1.0	120.0	-37.8072	144.9941
6	3	2.0	245.0	-37.8024	144.9993
7	2	1.0	256.0	-37.8060	144.9954

使用这些命令来直观检查数据是数据科学家工作的重要组成部分。你经常会在数据集中发现值得进一步检查的意外情况。

Building Your Model

在数据科学工作中，使用这些命令来直观地检查数据是一个重要的部分。你经常能在数据集中发现值得进一步检查的意外情况。

你将使用scikit-learn库来创建你的模型。在编码时，这个库写作sklearn，就像你在示例代码中看到的那样。Scikit-learn无疑是最受欢迎的库，用于建模通常存储在DataFrames中的数据类型。

构建和使用模型的步骤包括：

定义（Define）：它将是什么类型的模型？决策树？还是其他类型的模型？还会指定该模型类型的一些其他参数。
拟合（Fit）：从提供的数据中捕获模式。这是建模的核心。
预测（Predict）：正如它听起来的那样。
评估（Evaluate）：确定模型预测的准确性。

以下是一个使用scikit-learn定义决策树模型并用特征和目标变量进行拟合的示例：

python

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
# 定义模型，并指定 random_state 的数值以确保每次运行结果相同
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
# 拟合模型
melbourne_model.fit(X, y)

在这个示例中：
DecisionTreeRegressor 是 scikit-learn 中用于回归问题的决策树模型。
random_state=1 确保了每次运行代码时，随机数生成器的初始状态是相同的，这有助于在不同运行中获得一致的结果，特别是在处理随机数据分割时。
X 是特征数据，y 是目标变量（例如房屋价格），模型将使用这些数据进行训练。

python

DecisionTreeRegressor(random_state=1)

许多机器学习模型在模型训练中允许一定的随机性。指定 random_state 的数值可以确保每次运行时得到相同的结果。这被认为是一种良好的实践。你可以使用任何数字，模型的质量实际上不会显著依赖于你选择的具体数值。

现在我们已经拟合了一个模型，可以用来进行预测。

在实践中，你可能想要对新上市的房屋进行预测，而不是我们已经知道价格的房屋。但我们将在训练数据的前几行上进行预测，以了解 predict 函数的工作原理。

python

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

python

Making predictions for the following 5 houses:
Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]

Exercise：

Step 1: Specify Prediction Target

选择目标变量，即对应于销售价格的变量。将此变量保存到一个名为 y 的新变量中。你需要打印出列的列表来找到所需列的名称。

Asnwer:

python

# print the list of columns in the dataset to find the name of the prediction target
print(home_data.columns)
y = home_data['SalePrice']
# or y = home_data.SalePrice

Step 2: Create X

现在，你将创建一个名为 X 的DataFrame，用来存放预测特征。

由于你只想要原始数据中的一些列，你首先会创建一个包含你想要在 X 中包含的列名的列表。

你将在列表中使用以下列（你可以复制并粘贴整个列表以节省一些输入时间，尽管你仍然需要添加引号）：

LotArea
YearBuilt
1stFlrSF
2ndFlrSF
FullBath
BedroomAbvGr
TotRmsAbvGrd

在你创建了这个特征列表之后，使用它来创建你将用于拟合模型的DataFrame。

python

# Create the list of features below
feature_names = ["LotArea","YearBuilt","1stFlrSF","2ndFlrSF","FullBath","BedroomAbvGr","TotRmsAbvGrd"]

# Select data corresponding to features in feature_names
X = home_data[feature_names]

# Check your answer
step_2.check()

Review Data

Before building a model, take a quick look at X to verify it looks sensible

python

# Review data
# print description or statistics from X
print(X.describe)

# print the top few lines
print(X.head)

plaintext

<bound method NDFrame.describe of       LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0        8450       2003       856       854         2             3   
1        9600       1976      1262         0         2             3   
2       11250       2001       920       866         2             3   
3        9550       1915       961       756         1             3   
4       14260       2000      1145      1053         2             4   
...       ...        ...       ...       ...       ...           ...   
1455     7917       1999       953       694         2             3   
1456    13175       1978      2073         0         2             3   
1457     9042       1941      1188      1152         2             4   
1458     9717       1950      1078         0         1             2   
1459     9937       1965      1256         0         1             3   

      TotRmsAbvGrd  
0                8  
1                6  
2                6  
3                7  
4                9  
...            ...  
1455             7  
1456             7  
1457             9  
1458             5  
1459             6  

[1460 rows x 7 columns]>
<bound method NDFrame.head of       LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0        8450       2003       856       854         2             3   
1        9600       1976      1262         0         2             3   
2       11250       2001       920       866         2             3   
3        9550       1915       961       756         1             3   
4       14260       2000      1145      1053         2             4   
...       ...        ...       ...       ...       ...           ...   
1455     7917       1999       953       694         2             3   
1456    13175       1978      2073         0         2             3   
1457     9042       1941      1188      1152         2             4   
1458     9717       1950      1078         0         1             2   
1459     9937       1965      1256         0         1             3   

      TotRmsAbvGrd  
0                8  
1                6  
2                6  
3                7  
4                9  
...            ...  
1455             7  
1456             7  
1457             9  
1458             5  
1459             6  

[1460 rows x 7 columns]>

Step 3: Specify and Fit Model

Create a DecisionTreeRegressor and save it iowa_model. Ensure you’ve done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

python

# from _ import _
from sklearn.tree import DecisionTreeRegressor
#specify the model. 
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit the model
iowa_model.fit(X,y)

# Check your answer
step_3.check()

Step 4: Make Predictions

Make predictions with the model’s predict command using X as the data. Save the results to a variable called predictions

python

predictions = iowa_model.predict(X)
print(predictions)

# Check your answer
step_4.check()

好吧嘻嘻，后面是有一点偷懒了